-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Support heterogeneous environment service #3097
Conversation
SparkSnail
commented
Nov 17, 2020
•
edited
Loading
edited
- Support remote
- Support azureml
- Support local, 11.25
- Support PAI 11.25
- doc 11.25
- Refactor code, fix bugs and pr ready for review 11.26
Merge master
merge master
merge master
merge master
merge master
merge master
merge master
Chinese translation (microsoft#2458)
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
merge master
Merge master
merge master
ts/nni_manager/training_service/reusable/channels/heterogenousCommandChannel.ts
Outdated
Show resolved
Hide resolved
}); | ||
await Promise.all(tasks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if AML client sdk has the capability to do it in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted aml change here.
ts/nni_manager/training_service/reusable/environments/heterogenousEnvironmentService.ts
Outdated
Show resolved
Hide resolved
merge master
@@ -96,7 +100,7 @@ export class OpenPaiEnvironmentService extends EnvironmentService { | |||
} | |||
|
|||
const getJobInfoRequest: request.Options = { | |||
uri: `${this.protocol}://${this.paiClusterConfig.host}/rest-server/api/v2/jobs?username=${this.paiClusterConfig.userName}`, | |||
uri: `${this.protocol}://${this.paiClusterConfig.host}/rest-server/api/v2/jobs/${this.paiClusterConfig.userName}~${environment.envId}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes significant performance issue, and may trigger OpenPAI threshold to block API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted to use jobs API.
import { GpuScheduler } from './gpuScheduler'; | ||
import { MountedStorageService } from './storages/mountedStorageService'; | ||
import { StorageService } from './storageService'; | ||
import { TrialDetail } from './trial'; | ||
import { AMLEnvironmentService } from './environments/amlEnvironmentService'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be load in config, not hard code all subclasses here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't get point, how to load config?
|
||
private readonly trials: Map<string, TrialDetail>; | ||
private readonly environments: Map<string, EnvironmentInformation>; | ||
// make public for ut | ||
public environmentServiceList: EnvironmentService[] = []; | ||
public commandChannelDict: Map<Channel, CommandChannel>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be in environmentService itself. All use of this map is related to environment service. Explained in above comments, in some cases, it may not be a single instance. Maintain such a map in this level is not necessary and not a good practice also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
@@ -62,6 +71,8 @@ class TrialDispatcher implements TrainingService { | |||
private enableGpuScheduler: boolean = false; | |||
// uses to save if user like to reuse environment | |||
private reuseEnvironment: boolean = true; | |||
private logCollection: string = ''; | |||
private environmentMaintenceLoopInterval: number = 5000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to read it from all services, and take max one. It can improve performance on local and other lightweight services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
@@ -122,13 +132,7 @@ class TrialDispatcher implements TrainingService { | |||
|
|||
const trialId: string = uniqueString(5); | |||
|
|||
const environmentService = component.get<EnvironmentService>(EnvironmentService); | |||
let trialWorkingFolder: string = ""; | |||
if (environmentService.hasStorageService) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this logic now? It may break OpenPAI with shared folders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 705 in trialDispatcher.ts.
for(const platform of platforms) { | ||
let environmentService: EnvironmentService; | ||
switch(platform) { | ||
case 'local': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is not necessary like this, it should be initialized like a factory pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, add environmentServiceFactory class to genreate environmentService.
throw new Error(`${platform} not supported!`); | ||
} | ||
if (!this.commandChannelDict.has(environmentService.getCommandChannelName)) { | ||
switch(environmentService.getCommandChannelName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary, channel should be from service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
for (let index = 0; index < number; index++) { | ||
await this.requestEnvironment(); | ||
// Schedule a environment platform for environment | ||
private selectEnvironmentService(): EnvironmentService | undefined { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. It can be in GPU scheduler easily in future.